A Note on Levenshtein Distance versus Human Analysis

نویسنده

  • KAMIL STACHOWSKI
چکیده

This paper argues that automatic phonetic comparison will only return true results if the languages in question have similar and comparably lenient phonologies. In the situation where their phonologies are incompatible and / or restrictive, linguistic knowledge of both of them is necessary to obtain results matching human perception. Whilst the case is mainly exemplified by Levenshtein distance and Russian loanwords in Dolgan, the conclusion is also applicable to the approach as a whole. 0. Rationale and introductory notes In Stachowski (2010), I presented a method of quantifying the phonetic adaptation of loanwords, which heavily depends on prior human analysis. It has been suggested to me that it would be more valuable if the requirement could be removed for an expert analyst to specify the adaptations ahead of time. This question leads directly to the problem of how much linguistic knowledge, or knowledge of the languages being analyzed, is necessary for the results of an automatized assay to correlate with human (native speakers’) perception. Levenshtein distance (= edit ~) has more than once been shown to be capable of credible results (see e.g. Heeringa et al. 2006), even for genetically and typologically quite distant languages, as Kipchak Turkic vs. Iranian in van der Ark et al. (2007). However, it seems that this is much more often applied to phonologically quite similar languages such as Dutch, English, German or Norwegian dialects. Moreover, most of these languages are phonotactically relatively rich and therefore lenient, which Studia Linguistica Universitatis Iagellonicae Cracoviensis vol. 128 (2011) © E. Mańczak-Wohlfeld & WUJ Publikacja objęta jest prawem autorskim. Wszelkie prawa zastrzeżone. Kopiowanie i rozpowszechnianie zabronione 156 KAMIL STACHOWSKI appears to be key here. This is not at all the case with Dolgan, the Russian loanwords in which I have attempted to analyze. Neither is it the case with a great number of different languages, Turkic and other, to which in theory the method can be applied. Levenshtein distance has also been criticized for its crudity, resulting in the charge that it so completely misrepresents the nature of language (Heggarty 2006: 185). A number of refinements have been proposed, and also a number of other algorithms of varying degrees of advantageousness (see e.g. Heeringa et al. 2006 or Nerbonne, Heeringa 2009, etc.). Nevertheless, I have chosen to use the basic version of the method here for its popularity and simplicity, and because it represents quite well the crucial methodological assumptions common to at least the majority of propositions. I will: 1. present the results of contrasting Levenshtein distance with my index of nativization, as applied to Russian loanwords in Dolgan, 2. provide some further and typologically different examples of the incompatibility of Levenshtein distance with human perception, and 3. conclude in, hopefully, a positive way. 1. Russian loanwords in Dolgan In Stachowski (2010), I calculated for each of the 1169 identified Russian loanwords in Dolgan, an index of nativization (= degree of adaptation). It ranges from 0 (not nativized) to 1 (fully nativized). Examples: Russ. aèropórt ‘airport’ > Dolg. aèroport id. (index 0), lódka ‘boat’ > lokka id. (0.50931), vétka ‘Siber. canoe’ > bǟkkä id. (1). The leading assumption of this method is that adaptations which are more common contribute less to the final score than those which are rarer. This entails that adaptations need to be identified ahead of time, and it is here that the first obstacle arises. Some of the adaptations require precisely the knowledge of Dolgan phonology in order to be recognized. For example, the -dk(= [-tk-]) > -kk-change observed in lokka above, is not merely an assimilation but in fact an application of one of Dolgan phonotactic rules which are obligatory in native words across morpheme boundaries. These are also sometimes exercised for loanwords but this is a very rare case (only seven examples in the corpus of 1169 words). Hence the relatively high score of 0.5 although two out of three adaptations have not been applied here – a fully nativized shape would be *luokko or *lōkko. Levenshtein distance is not bothered by the commonness of the given change. It measures the phonetic distance between two forms. Naturally, this requires precise phonetic transcriptions of both words in order to return valid results. This is the second obstacle. Detailed recordings are available for Dutch, English, German, Norwegian, etc. but are missing and much more difficult to obtain for lesser known and more distant languages such as Dolgan. What is more, extinct and reconstructed languages have to be automatically excluded, together with any borrowings which occurred from a dialectally mixed society, where the exact pronunciation is often impossible to establish. This happens to be the case with Russian in northeastern Siberia. Studia Linguistica Universitatis Iagellonicae Cracoviensis vol. 128 (2011) © E. Mańczak-Wohlfeld & WUJ Publikacja objęta jest prawem autorskim. Wszelkie prawa zastrzeżone. Kopiowanie i rozpowszechnianie zabronione A note on Levenshtein distance versus human analysis 157 If one nevertheless decided to wade on, they would therefore find themselves forced to measure phonological rather than phonetic distance. This allows further investigation but it makes a significant difference. One difficulty is to decide which phonemes can be treated as corresponding, and which cannot. V does not occur in Dolgan but in loanwords. If they were considered Fremdwörter, Dolg. b could correlate with both Russ. b and v. Such a solution might seem to be an exaggeration at first but its fabricated feel quickly wanes away as the number of obstacles of this type grows. This is the reason why I only provide approximate counts of examples below. An exact number would require many methodological decisions and discussing them would be beyond the scope of the current note. On the other hand, a move to phonology brings the results closer to reality elsewhere. K and ḱ are allophonic in Russian in some positions, and so they are in Dolgan. In vétka ‘Siber. canoe’, the k is not palatalized whereas in bǟḱḱä id. both k’s are, and in both cases this is not phonemic. Adopting a phonological transcription will improve the Levenshtein distance by freeing it from incorporating an irrelevant difference in its result. I calculated the Levenshtein distance for the entire corpus of Russian loanwords in Dolgan (with indel = sub = 1) and contrasted the results with my index. The correlation turned out to be 0.43, and this hardly came as a surprise. First and foremost, the methodological approach is dramatically different. Let us consider a few cases: • Both measures closely match This mostly happens when most of the possible adaptations have not been applied or when very few adaptations are applicable, and they were skipped. Both measures are 0 or draw near to it and thus they match or almost match. In the case of Russian loanwords in Dolgan, such examples account for less than a fourth of the total number. Examples: Russ. patrón ‘cartridge’ > Dolg. patruon id. (index 0.04667), pártija ‘party’ > pārtija id. (0.02864), rabóčij ‘worker’ > rabočaj id. (0.11023), žurnál ‘journal’ > žurnāl id. (0.00828); Russ. čas ‘1. hour; 2. clock’, kak ‘since’, maj ‘May’, šar ‘i.a. balloon’ > Dolg. ≡ (indices 0). A close match can also happen in other situations, in particular when the adaptations applied exhaust all the possibilities as completely as much they change the phonetic shape. However, such examples only account for less than an eleventh of the total number. Examples: Russ. blagosloví ‘may he bless’ > Dolg. lastabi id. (index 0.78626), Fëdor (given name) > Pǟdär id. (0.3666), vóvse ‘completely’ > buosa id. (0.44574), zdoróvьe ‘health’ > dorōbuja id. (0.55297). • The two measures are almost opposite This mostly happens when there are very few adaptations possible and they have all been applied but without changing the word’s phonetic shape much. Studia Linguistica Universitatis Iagellonicae Cracoviensis vol. 128 (2011) © E. Mańczak-Wohlfeld & WUJ Publikacja objęta jest prawem autorskim. Wszelkie prawa zastrzeżone. Kopiowanie i rozpowszechnianie zabronione 158 KAMIL STACHOWSKI Such cases are very rare and only account for less than a twenty-seventh of the total number. Examples: Russ. Ánna (given name) > Dolg. Ānna id., barán ‘fur jacket’ > barān id., pop ‘Orthodox priest’ > puop id., ukázka ‘pointer’ > ukāska id. (all indices 1). • The two measures diverge randomly This happens when the degree in which the applied adaptations exhaust all the possibilities, does not coincide with their shape-changing power. One borderline case of this has already been mentioned above. -dk> -kkin lódka ‘boat’ is phonetically a minor assimilation but in Dolgan, it is a sign of far-reaching nativization. Óčeredь ‘order, sequence’ > uočarat id. (index 0.12155), on the other hand, is phonetically a considerable change but for Dolgan phonology, it is merely a combination of a fairly common substitution of a diphthong for a Russian accented vowel (unsurprisingly, especially often with ó), an even more common repair of vowel harmony, and an equally common removal of palatalization from  since such sound does not exist in Dolgan, and t does. This is by far the most common case and it accounts for about two thirds of the total number. Examples: Russ. krovátь ‘bed’ > Dolg. kyrbat id. (index 0.99227), Oksínьja (given name) > Oksiäńńä id. (0.02864), séjanka (a kind of meat dish) > hiäŋki id. (0.2576), Vasílьevič (patronym) > Bahylajbys (0.99676). To conclude, Levenshtein distance applied to Russian loanwords in Dolgan will return a valid and true measurement of phonological difference between the etymon and the loanword – but it will be a purely surface measure which may or may not correlate with actual human perception. More often the latter. The Levenshtein algorithm is quite flexible and probably can be refined so as to take note of those adaptations which are phonotactically trivial but phonetically devastating to the shape of the etymon, such as vowel harmony. Should this prove impossible, another algorithm can be used or a new one can be invented to perhaps achieve a full correlation with human perception. However, the crux is that it will always have to be based on the knowledge of the languages in question. This knowledge can be obscured by using a universal algorithm which itself learns from training data (e.g. Dunning 1994, Sanders, Chin 2009) but this does not change the essential need for such knowledge in general.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Metric for Music Notation Transcription Accuracy

Automatic music transcription aims at transcribing musical performances into music notation. However, most existing transcription systems only focus on parametric transcription, i.e., they output a symbolic representation in absolute terms, showing frequency and absolute time (e.g., a pianoroll representation), but not in musical terms, with spelling distinctions (e.g., A[ versus G]) and quanti...

متن کامل

Measuring Musical Rhythm Similarity: Edit Distance versus Minimum-Weight Many-to-Many Matchings

Musical rhythms are represented as binary symbol sequences of sounded and silent pulses of unit-duration. A measure of distance (dissimilarity) between a pair of rhythms commonly used in music information retrieval, music perception, and musicology is the edit (Levenshtein) distance, defined as the minimum number of symbol insertions, deletions, and substitutions needed to transform one rhythm ...

متن کامل

Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data

The Levenshtein dialect distance method has proven to be a successful method for measuring phonetic distances between Dutch dialects. The aim of the present investigation is to validate the Levenshtein dialect distance with perceptual data from a language area other than the Dutch, namely Norway. We calculate the correlation between the Levenshtein distances and the distances between 15 Norwegi...

متن کامل

Using Distributional Semantic Models and Levenshtein Distance Normalization

In the medical domain, especially in clinical texts, non-standard abbreviations are prevalent, which impairs readability for patients. To ease the understanding of the physicians’ notes, abbreviations need to be identified and expanded to their original forms. This thesis presents a distributional semantic approach to find candidates of the original form of the abbreviation, which is combined w...

متن کامل

Adaptating the Levenshtein Distance to Contextual Spelling Correction

In the last few years, computing environments for human learning have rapidly evolved due to the development of information and communication technologies. However, the use of information technology in automatic correction of spelling errors has become increasingly essential. In this context, we have developed a system for correcting spelling errors in the Arabic language based on language mode...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011